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Abstract 

Consider a dataset of n(d) points generated independently from M. d according to a com- 
mon p.d.f. fd with support(fd) = [0, l] d and sup{fd([0,l] d )} growing sub-exponentially 
in d. We prove that: (i) if n(d) grows sub-exponentially in d, then, for any query point 
(f 1 £ [0, l] d and any e > 0, the ratio of the distance between any two dataset points and 
cf* is less that 1 + e with probability — > 1 as d — > oo; (ii) if n(d) > [4(1 + e)] d for large 
d, then for all q* 1 E [0, l] d (except a small subset) and any e > 0, the distance ratio is 
less than 1 + e with limiting probability strictly bounded away from one. Moreover, we 
provide preliminary results along the lines of (i) when fd = N(fld, £<j). 
Key words: information retrieval, curse of dimensionality 

1. Introduction 

Nearest neighbor search on high-dimensional data is a difficult (and well-studied) 
problem, in part, because many commonly used distance functions can exhibit greatly 
different behavior in low versus high-dimensional spaces - a phenomenon often referred 
to as the "curse of dimensionality" . In an effort to rigorously analyze this phenomenon, 
Beyer et al. Q defined a nearest neighbor query with respect to a reference query point 
(f 1 E M. d as unstable if all of the points in the dataset are nearly the same distance from 
g". In this event, the query can be thought meaningless since there is little reason to 
return any one point over another (see figure 2 in [3|). Beyer et al. (then later others 



El. Iii 



[111 ]) established sufficient conditions on the data generation distributions and dataset 
sizes under which the probability of query instability approaches one as d — > oo. Such 
conditions provide useful insight into how the curse can be mitigated or must be tolerated 
as unavoidable. We develop a new set of sufficient conditions which improve upon the 
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current ones - see sub-sections 11.21 and 1 1 . 31 for a description of our contributions and their 
relationship to the literature. 



1.1. Notations and Definitions 

Given n(.) : N — > N, we represent a e?-dimensional, size n(d) dataset with i.i.d. random 
vectors Y±, . . ., Y n u) having common p.d.f. fd- Let support(fd) denote the topological 

closure of {y £ M. d : fd{y) > 0}. Given posative real number p, the distance between a 

i/p 



pair of points z,w e I is defined as: 



J = l 



W 



J 



Given e > 0, 



the probability of a nearest neighbor query if 1 £ support(fd) being unstable is Pd, n (.),q d 



Pr 



max 



l{d) S 1 ] Yi - ?\ \ p } < (1 + e) min^f { | \t - cf \ \ p } 



i=l 



The space of all possible query point sequences is Y[T=i su PP or t(fd)- We say that 
data distribution sequence {fd '■ d = 1,2, •••} and dataset size function ri(.) admit 
nearest neighbor instability if for any e > and any query point sequence {(f^} G 

YidLi su PP or t{fd)i h is the case that lirrid^tooPdM.),? 1 ~ ^' ^ e sa ^ ^hat {/<*} an< ^ n (-) 
strongly fail to admit nearest neighbor instability if there exists ( < 1 and a "large" 
Q C rid=a su PP or t(fd), such that for any e > and for any {g^} S <2, it is the case 
that Hmd^ocPd,n(.),q d < C- Let Q d denote the d th component of Q. We say that Q is 
"large" if for any < oj < 1, it is the case that, limd—, ^ V '°v'^me(g°') t ^' i ^ = 0- Note, 
if support(fd) = [0, l] d , this last condition is equivalent to lirrid^oo V voiume{Q d ) ^ = ^ 

A function g : N — ► N is said to grow sub-exponentially if liwid^oo l ° 9 ^ 9 J- d ^ = 0. A 
sequence of functions, fd ■ R d — > K; <i = 1,2,..., is said to be bounded above sub- 
exponentially if, for all d, sup{fd(R d )} < g(d). 

1.2. Our Contributions 

For any {fd} bounded above sub-exponentially and support(fd) = [0, l] d , we prove 
the following: (i) if ri(.) grows sub-exponentially, then nearest neighbor instability is 
admitted; (ii) if n(d) > [4(1 + e)] d for large d, then (with p > 1) instability strongly fails 
to be admitted. Moreover, we describe preliminary results toward establishing sufficient 
conditions under which {N(f2d, ^d)} admits instability. 
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1.3. Related Work 



Beyer et al. 3j established sufficient conditions upon n(.) and {fd} for the admission 
of nearest neighbor instability. They provecj^that instability is admitted if is constant 
and \fd\ satisfies: Umd^ooVar [ Jj^t- <? 4m' 1 = 0, for any {gf} (the relative variance goes 

Jl L E[\ \ Y i —q \\p\ j 
3 that (Corollary 5.5) instability is admitted (except for a 

small set of query point sequences) if n(.) is sub-exponentially growing and {fd} satisfies 

three conditions, most notably, {fd} forms a normal Levy family as defined with respect 

to the "concentration of measure" phenomena. Francois et al. [J] proved that instability 

is admitted (with {t^} = {0}) if n(.) is constant and each distribution in {fd} has i.i.d. 

attributes with mean and variance not dependent on d. 

Our contributions significantly advance the above results as follows. Our sufficient 
conditions allow n(.) to grow with d (unlike Beyer et al. and Francois et al.), are quite 
broad (unlike Francois et al. who require the data distributions to have i.i.d. attributes), 
and are easy to interpret (unlike Beyer et al. or Pestov et al. which leave open the 
question of which data distribution sequences satisfy the relative variance condition or 
normal Levy condition, respectively). Moreover, we provide results showing that the sub- 
exponential growth assumption on n(.) is strongly necessary: if n(.) grows exponentially, 
then instability fails to be admitted for a large space of query point sequences. Finally, we 
provide preliminary results toward establishing sufficient conditions for {N^ld, £d)}- To 
our knowledge, the sufficient conditions for this distribution sequence remain unknown. 

Aggarwal et al. [2| considered distance functions with p a positive integer and proved 
that, for constant n(.) = N and data distributions with i.i.d. attributes supported on 

(0,1), C p < hmd^oo— Vi/p-i/2 < (N - l)C p , with C p a constant not 

dependent on d. They argued that high-dimensional nearest neighbor behavior is sharply 
different for each of the following three types of distance functions: p = 1, p = 2, and 
p > 3. However, unlike our contributions, they do not provide sufficient conditions on 
instability and they make the restrictive i.i.d. data attribute assumption. Hsu and Chen 
Q proved^ that, for constant n(.), the relative variance condition of Beyer is a necessary 



lr They considered any non- negative distance function and did not restrict query points to reside in 

support(f d ). 

2 He considered any metric distance function. 
3 They consider any non-negative distance function. 
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as well as a sufficient condition for instability admission. They go on to develop a basis 
for empirically testing whether instability is exhibited. 

Shaft and Ramakrishnan (l^ considered the related problem of analytically quanti- 
fying the inherent limits of nearest-neighbor indexing on high-dimensional data. They 
proved that, under conditions related to those in Beyer et ai, the performance of a broad 
class of index structures approaches that of linear scan as d — » oo. In the stochastic geom- 
etry literature, Zanger [l3| studied the behavior of a general class of clustering functions 
as d — ► co and established a connection to the concentration of measure phenomenon. A 
more broadly studied problem in this literature is the behavior of nearest neighbor struc- 
tures as the dataset size goes to infinity and d remains constant. For example, Penrose 
[lol | considered data generated i.i.d. from a continuous p.d.f. with compact support (and 
"smooth" boundary) and showed that, as N — > oo, the distance of any point to its k 
nearest neighbor converges, almost surely, to a constant not dependent on N. 

A vast literature exists on the development of data structures and algorithms for 
nearest neighbor search, for brevity, see the discussion and citations in Q ■ 



2. Instability Results 



First we develop a lower-bound on Pd. n (.),q d making no assumptions on {fd} or n(.). 
Define 8(e,p) = [(1 + e) p - !]/[(! + e) p + 1] and let 7 > 0. If for all 1 < i < n(d), 



% - <?% - 7 < J5(e,p), then max*? {| ft - < min^ { 1 1 % - <?\\^± 



" S(e,p)] ~ 

mm^m ~ + t) v - Thus, max^iH^ - <?%} < min^{||^ - ^|| P }(1 + e). 



Using this and the fact that Y\ , 



■Y n {d) are i.i.d., 



Vi 

= [I- Pr 



1^-^-7 



< jS(e,p) 
\\Y 1 -q d \\P- 7 \> 7 5(e,p)Y) 



n(d) 



(1) 



Our results are reduced to upper-bounding the probability that a sum of random variables, 
|Yi — (f 4 !!^, deviates significantly from a fixed value 7. To our knowledge, developing a 
useful bound in the most general case is not possible. To get around this problem, we show 
how our assumptions on {fd} and n{.) allow the dependences between the components 
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of (Yi — (f 1 ) to be broken, and thus, open the door to applying standard concentration 
bounds {e.g. Hoeffding) on the r.h.s. of (fTJ) . 

Assume {fd} is bounded above sub-exponentially and support(fd) = [0,1] . Let 
U\, Ud be i.i.d. and distributed uniformly on [0,1]. Let S denote {y 6 [0, l] d 
'■ \\W~ Q^llp — l\ > 7^( e :P)}- There exists sub-exponentially growing function /?(.) such 
that, 



Pr 



\Yi 



> l^(e,p) 



fd(y)dy 



'yes 



yes 



d 

Em 



' ' -28(e,pf 


d 


2 \ 


(p+l)2P 


d 

V 





f3(d)Pr 



< (i{d)2exp 



The first equality and inequality follow from the fact that support(fd) — [0, l] d and 
fd is bounded above sub-exponentially, respectively. The second inequality follows from 
Theorem 2 of Hoeffding 6] j Plugging this bound into the r.h.s. of inequality ((T|) yields an 
expression which goes to one as d — * oo, due to the sub-exponential growth assumptions 
on n(.) and (}{.). 



3. Dataset Size Assumption 

Now we relax the assumption that n(.) grows sub-exponentially while still assuming 
that {fd} is bounded above sub-exponentially and support(fd) = [0, l] d . Suppose that, 
for large d, n(d) > [4(1 + e)] d . We further assume that p > 1. Our goal in this section is 
to show that {fd} and n(.) strongly fail to admit instability. 



4\Vith 7 = £?=i E[\Uj - q<jn Xj = \Uj - qf\P, and t = (s(e,p) E^i E[\U 3 - q^}) /d. Clearly 
t > 0. Also, since support(Uj) = [0, 1] and a* 1 e support(f d ) = [0, l] d , then < \Uj - qf\ p < 1. Finally, 
E[\Uj-qf\P] = [(qj) p+1 + (l-qf) p+1 ]/(p+l) which, for < qf < 1, obtains its minimum of l/(p+l)2 p 
at q* = 1/2. 
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Fix 99/100 < C < 1 and define Q d as {cf 1 6 [0, l] d : Pr[max"if - - (1 + 
e)min"i d 1 ) ||^ - > 0] > 1 - C} and Q = ng =1 Q d . Clearly, for any {^} G Q, 

Umd-*ooPd,n(:),q d < C Hence, all that remains is to show that Q is large, i.e. for any 

Let Y" be distributed as /d and be independent of Yi, . . . , Y n u\- Define random vari- 
ables D min = min™i^{||y — yUp}, D ma a; = max™^ {||Y — y|| P }- Such random variables 
(or related ones) have received considerable study in the stochastic geometry literature. 
Using one such study we prove, in Appendix [Aj the following two inequalities with Z 
denoting D max - (1 + e)D min : 



E\Z] 1 
limd-tno I / > and Volume! Q d ) > 



1 



cm 



E[Z ]+C 



(2) 



For any < u> < 1, inequalities (JH) as well as the assumptions that 99/100 < C < 1 and 

' olume{[Q ,tj] d 
Volume(Q d ) 



13(d) grows sub-exponentially imply that limd^oo V y U ^ e ^Qd \ ^ — 0, as needed. 



4. Multi-Variate Gaussian Distributions Preliminary Results 

We provide preliminary results concerning instability admission over an important 
class of distributions that do not satisfy our assumptions above: {N(fid> Ed)}- The fol- 
lowing simple strategy yields a sufficient condition in the case that: — 0, fid = 0, 
p = 2, and the number of eigenvalues of E^ which do not go to zero grows faster than 
Using the eigenvalue decomposition of Ed, it can be shown that 



Pr [\WYxWt -ElWY^l >25[||n||2]a( c ,2; 





d 






d 










> E 


E^ 2 


S(e,2) 















= Pr 



where the W's are independent and distributed as N(0, A^) with A,, the j th largest eigen- 
value of Erf. Chebyshev's inequality shows that the r.h.s. of the equation above is bounded 
above by 
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Plugging this bound into the r.h.s. of inequality ([T]), with 7 = 2?[||Yi|||], our assumptions 
above on n(.) and the X's imply that limd^ooPd,n(.),q d — 1- 

Extending the above strategy to g™, jld ^ and larger growth rates for n(.) seems 
possible utilizing more complex properties of weighted, non-central chi-square distribu- 
tions. However, extending beyond p = 2 seems difficult as only the 2-norm is preserved 
by orthogonal transformations. Also, extending beyond multi-variate Gaussian data dis- 
tributions seems difficult owing to the fact that independence of the W's depends upon 
the Gaussian assumption. 

A. Appendix: Some Proofs 

First we prove the left inequality in @: lira^oo ^lf} > yjjo, where Z = D max — (1 + 
e)D mm = max^Hf - i*|L} - (1 + e)mm^{\\Y - %\\ p }. 

Theorems 1.1 and 1.2 of [9j produce an upper-bound on E[D m i n ] and a lower-bound 
on E[D max ], respectively. These combine to yielcjf] 



E[Z] T(n(d) + l/d)T(n(d) + 1) 



d 1/p dVpr(n(d))r(n(d) + l + l/d)3V2 2 VVV2rf)||/ d ||2/^/ d 

2(1 + e) / 1 + e 



Thus 







dV?(n(d) + iy/ d vV d \d 1/p {n(d) + 



E \Z] ( 1 1 



dV? ~ a ^°° \ 3 i/2 d i/ PV yd 2di/PVl /d J ' 
From 8] (using the fact that p > 1) and Stirling's approximation^] of T(.) (6.1.3.7 in [l(), 
limd->ood l t p V d 1 ^ < 2(ep) 1 / p . Hence, the above limit is bounded below by (1/100), as 
desired. 



5 Vd :P denotes the volume of the unit-ball in M. d with respect to the p-norm. T(.) denotes the standard 
gamma function. 

6 lirrid^oo I |/d| I2 ^ 1 since support(fd) = [0, l] d and sequence {/d} is bounded above sub- 
exponentially. Also, the ratio of the TQ's approaches one because of the equality T(z + 1) = zY(z) 
for any z £ K. Finally, limd-^oo ( n (d) + l) 1 ^ >4(l + e) since, by assumption, n(d) > [4(1 + e)] d for large 
d. 

7 For large z, T(z) exp{-z)z z - 1 / 2 (2ir) 1 / 2 . 
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1 



where 



Now we prove the right inequality in @: Volume(Q d ) > 
Q d = {q d E [0, l] d : Frtmax^f ||g«< - Yi\\ p - (1 + e) min™^ - K 2 || p > 0] > 1 - (} and 
99/100 < C < 1. 

Let fz and f z ^ denote the p.d.f of Z and the conditional p.d.f of Z given Y, 
respectively. Since support{f d ) = [0, l] d , then support(fz) C [0,d 1 / p ], thus, £[Z] = 

Co zh(z)dz < dl/p h{z)dz = dVP ; 2 f; /_. de[o i]£i / z| ^(z|^)/ d (<f)d^ = 

dVp W il- f z M<?)M<?)dzd<f- Hence, 



g[g] 
d l /p 



< 



q d G[0,l] d 



z=0 



dq 



fd(<?)Pr 



g d e[o,i] d 



7i(d) -> n(d) 

maxiWq* -Yi\\ p }- (1 + e) min{ 

i—l ' i— 1 



< 



g«eQ<J Jq d <£{[0,l] d \Q d ) 

UtfW + (i - o / /4<?W 

ij-eC Jg d e([o,i] d \S' 1 ) 



r-^n P }>o 
•] ^ 



9? 



= Pr[Y e Q d ] + (1 - ()Pr[Y € ([0, l] d \ Q d )] 

= C-Prff e Q d ] + 1 - c 

< QI3{d)Volume{Q d ) + 1 - C- 

The second inequality follows from the definition of Q d and the last inequality follow 
from the assumption that fd is bounded above sub-exponentially. The desired inequality 
follows. 
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